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Abstract 


The  structure-from-motion  problem  has  been  extensively  studied  in  the  field  of  computer 
vision.  Yet,  the  bulk  of  the  existing  work  assumes  that  the  scene  contains  only  a  single  moving 
object.  The  more  realistic  case  where  an  unknown  number  of  objects  move  in  the  scene  has 
received  little  attention,  especially  for  its  theoretical  treatment.  In  this  paper  we  present  a  new 
method  for  separating  and  recovering  the  motion  and  shape  of  multiple  independently  moving 
objects  in  a  sequence  of  images.  The  method  does  not  require  prior  knowledge  of  the  number 
of  objects,  nor  is  dependent  on  any  grouping  of  features  into  an  object  at  the  image  level. 
For  this  purpose,  we  introduce  a  mathematical  construct  of  object  shapes,  called  the  shape 
interaction  matrix,  which  is  invariant  to  both  the  object  motions  and  the  selection  of  coordinate 
systems.  This  invariant  structure  is  computable  solely  from  the  observed  trajectories  of  image 
features  without  grouping  them  into  individual  objects.  Once  the  matrix  is  computed,  it  allows 
for  segmenting  features  into  objects  by  the  process  of  transforming  it  into  a  canonical  form,  as 
well  as  recovering  the  shape  and  motion  of  each  object. 
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1  Introduction 

A  motion  image  sequence  allows  for  the  recovery  of  the  three-dimensional  structure  of  a  scene. 
While  a  large  amount  of  literature  exists  about  this  structure-from-motion  problem,  most  previous 
theoretical  work  is  based  on  the  assumption  that  only  a  single  motion  is  included  in  the  image 
sequence;  either  the  environment  is  static  and  the  observer  moves,  or  the  observer  is  static  and  only 
one  object  in  the  scene  is  moving.  More  difficult  and  less  studied  is  the  general  case  of  an  unknown 
number  of  objects  moving  independently.  Suppose  that  a  set  of  features  has  been  extracted  and 
tracked  in  an  image  sequence,  but  it  is  not  known  which  feature  belongs  to  which  object.  Given 
a  set  of  such  feature  trajectories,  the  question  is  whether  we  can  segment  and  recover  the  motion 
and  shape  of  multiple  objects  contained  in  the  image  sequence. 

The  previous  approaches  to  the  structure-from-motion  problem  for  multiple  objects  can  be 
grouped  into  two  classes:  image  motion-based  (2D)  and  three-dimensional  (3D)  modeling.  The 
image-motion  based  approach  relies  mostly  on  spatio-temporal  properties  of  an  image  sequence. 
For  example,  regions  corresponding  to  different  velocity  fields  are  extracted  by  using  Fourier 
domain  analysis  [18][1]  or  scale-space  and  space-time  filters  [3,  5,  8,  9].  These  image-based 
methods  have  limited  applicability  either  because  object  motions  are  restricted  to  a  certain  type, 
such  as  translation  only,  or  because  image-level  properties,  such  as  locality,  need  to  be  used  for 
segmentation  without  assuring  consistent  segmentation  into  3D  objects. 

To  overcome  these  limitations,  models  of  motion  and  scene  can  be  introduced  which  provide 
more  constraints.  Representative  constraints  include  rigidity  of  an  object  [17]  and  smoothness  (or 
similarity)  of  motion  [13,  1 1,  2,  4].  Then  the  problem  becomes  segmenting  image  events,  such  as 
feature  trajectories,  into  objects  so  that  the  recovered  motion  and  shape  satisfy  those  constraints. 
It  is  now  a  clustering  problem  with  constraints  derived  from  a  physical  model.  Though  sound  in 
theory,  the  practical  difficulty  is  the  cyclic  dilemma:  to  check  the  constraints  it  is  necessary  to 
segment  features  and  to  segment  it  is  necessary  to  compute  constraints.  So,  developed  methods 
tend  to  be  of  a  "generate-and-test"  nature,  or  require  prior  knowledge  of  the  number  of  objects 
(clusters).  Ullman  [17]  describes  a  computational  scheme  to  recursively  recover  shape  from  the 
tracks  of  image  features.  A  model  of  the  object’s  shape  is  matched  to  the  current  position  of  the 
features,  and  a  new  model  that  maximizes  rigidity  is  computed  to  update  the  shape.  He  suggests 
that  this  scheme  could  be  used  to  segment  multi-body  scenes  by  local  application  of  the  rigidity 
principle.  Since  a  single  rigid  body  model  does  not  fit  the  whole  data,  collections  of  points  that 
could  be  explained  by  a  rigid  transformation  would  be  searched  and  grouped  into  an  object.  Under 
the  framework  of  the  factorization  method  [16],  this  view  of  the  problem  is  followed  by  Boult 
and  Brown  [4]  and  Gear  [7],  where  the  role  of  rigidity  is  replaced  by  linear  dependence  between 
feature  tracks.  Since  the  factorization  produces  a  matrix  that  is  related  with  shape,  segmentation  is 
obtained  by  recursively  clustering  columns  of  feature  trajeetories  into  linearly  dependent  groups. 

This  paper  presents  a  new  method  for  segmenting  and  recovering  the  motion  and  shape  of 
multiple  independently  moving  objects  from  a  set  of  feature  trajectories  tracked  in  a  sequence  of 
images.  Developed  by  using  the  framework  of  the  factorization  by  Tomasi  and  Kanade  [16],  the 
method  does  not  require  any  grouping  of  features  into  an  object  at  the  image  level  or  prior  knowledge 
of  the  number  of  objects.  It  directly  computes  shape  information  and  allows  segmentation  into 
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objects.  This  has  been  made  possible  by  introducing  a  linear-algebraic  construct  of  object  shapes, 
called  the  shape  interaction  matrix.  The  entries  of  this  matrix  are  invariant  to  individual  object 
motions  and  yet  is  computable  only  from  tracked  feature  trajectories  without  knowing  their  object 
identities  (ie,  segmentation).  Once  the  matrix  is  computed,  transforming  it  into  the  canonical  form 
results  in  segmenting  features  as  well  as  recovering  the  shape  and  motion  of  each  object.  We  will 
present  our  theory  by  using  the  orthographic  camera  model.  It  is,  however,  easily  seen  that  the 
theory,  and  thus  the  method,  works  under  a  broader  projection  model  including  weak  perspective 
(scaled  orthography)  and  paraperspective  [12]  up  to  an  affine  camera  [10] 

In  the  next  section,  we  will  first  re-derive  the  factorization  method  in  homogeneous  coordinates 
including  the  translational  motion  component.  Then,  section  3  will  provide  geometrical  interpre¬ 
tation  of  the  matrices  involved  in  factorization,  which  will  be  useful  for  developing  the  multi-body 
factorization  method  in  section  4.  Finally  in  sections  5  and  6  we  present  experimental  results,  and 
discuss  implications  of  our  method. 
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2  Factorization  Method:  A  New  Formulation  Including  Trans¬ 
lation 

The  factorization  method  was  originally  introduced  by  Tomasi  and  Kanade[15,  16]  for  the 
case  of  single  object  motion.  The  core  of  the  method  is  a  procedure  based  on  singular  value 
decomposition  that  separates  a  matrix  of  measurements  into  the  product  of  two  matrices  which 
represent  the  shape  and  motion  of  an  object,  respectively.  The  method  does  not  need  any  prior 
assumptions  about  either  structure  or  motion. 

The  original  Tomasi-Kanade  formulation  addressed  the  case  of  a  moving  camera  observing  a 
static  scene.  In  this  section  we  will  reformulate  the  method  in  such  a  way  that  a  static  camera 
observes  a  scene  with  a  moving  object.  Also,  whereas  the  translation  component  of  motion  is  first 
eliminated  in  the  Tomasi-Kanade  formulation,  we  will  retain  that  component  in  our  formulation. 
Though  equivalent  for  the  single  object  case,  the  new  formulation  with  these  two  changes  simplifies 
some  of  the  representations  and  allows  for  easy  extension  to  the  case  of  multiple  moving  objects. 

2.1  World  and  Observations 

Let  us  assume  for  the  moment  that  a  static  camera  observes  a  single  moving  object.  To 
represent  the  situation  we  need  two  coordinate  systems:  a  moving  system  O  attached  to  the  object, 
and  a  static  system  C  attached  to  the  camera  as  shown  in  figure  1. 


Figure  1 :  This  figure  shows  the  camera  and  the  object  with  its  coordinate  system.  The  (unit) 
vectors  i,  j  define  the  image  plane  and  k  its  normal. 

Consider  a  point  pi  on  the  object.  Its  position  represented  in  the  camera  coordinate  system  at 
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instant  /  is  given  by  the  transformation. 


P/i  —  R/  Pi  +  ty. 


is  the  rotation  matrix  whose  rows  iyf  j/  =  [jxf  jy^  jz\  and  ky^  k^^^  are 

the  axes  of  the  camera  coordinate  frame  C  expressed  in  the  object’s  frame.  The  vector 


represents  the  position  of  the  object’s  coordinate  frame  at  instant  /  in  the  camera  frame.  The 
representation  (1)  can  be  simplified  if  we  use  homogeneous  coordinates, 

r  X  ■ 

L 1 

for  the  object’s  point.  In  the  homogeneous  coordinates,  equation  (1)  can  be  expressed  as 

c  _  [  P/i  1  _  [  1  [  P*  ]  ("S') 

1  ~  0ix3  1  1 


The  camera  is  modeled  as  an  orthographic  projection.  It  produces  the  image  by  projecting  a 
world  point  parallel  to  the  optical  axis  onto  the  image  plane.  The  image  of  point  p*  at  time  /  is 
then  given  by  the  first  two  elements  of  p^-: 


[  J  L  Jyj  \  ^y)  J 

The  object  moves  relative  to  the  camera  which  acquires  images.  In  the  sequence  we  track 
feature  points  from  frame  to  frame.  Suppose  that  we  track  N  feature  points  over  F  frames,  and 
that  we  collect  all  these  measurements  into  a  single  matrix 
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Each  row  of  W  lists  the  image  coordinates  w  or  of  all  the  feature  points  in  each  frame,  and  each 
column  represents  the  image  trajectory  of  one  feature  over  the  whole  image  sequence. 

Using  (7)  we  can  represent  W  as  the  matrix  product, 
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If  we  denote 
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as  the  motion  and  shape  matrices  respectively,  we  have  the  compact  representation 

W-MS. 


(9) 


(10) 


(11) 


(12) 


Compared  with  the  original  formulation  in  [16],  note  that  the  motion  matrix  contains  transla¬ 
tional  components  4^  and  ty^ .  In  the  original  formulation  they  were  eliminated  from  the  bilinear 
equation  corresponding  to  (9)  by  subtracting  the  means  of  the  measurements  beforehand.  That 
procedure  reflected  the  fact  that  under  orthography  the  translation  components  does  not  provide 
any  information  about  shape.  In  the  case  of  a  scene  with  multiple  independent  moving  objects,  the 
situation  is  not  the  same:  on  one  hand  the  translation  component  of  each  object  cannot  be  computed 
without  segmentation,  and  on  the  other  hand  the  translation  component  does  provide  information 
for  segmenting  the  multi-body  scene.  For  these  reasons,  we  include  the  translation  component  in 
our  formulation  of  the  factorization  method. 


2.2  Solution  for  Shape  and  Motion  by  Factorization 

We  have  derived  the  bilinear  relationship  (9)  by  modeling  the  imaging  process.  The  problem 
of  recovering  the  shape  and  motion  is  to  start  with  a  given  matrix  and  obtain  a  factorization  into 
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motion  matrix  M  and  shape  matrix  S.  By  simple  inspeetion  of  (9)  we  can  see  that  since  M  and  S 
can  be  at  most  rank  4,  W  will  be  at  most  rank  4.  In  real  situations  W  is  constructed  from  noisy 
measurements,  so  the  rank  of  W  can  be  higher  due  to  noise  in  the  feature  tracking.  Among  all  the 
possible  matrix  decompositions,  singular  value  decomposition  (SVD)  is  the  most  robust  and  the 
best  rank  revealing  of  all  [14]  to  approximate  W  by  a  rank-4  matrix.  With  SVD,  W  is  decomposed 
and  approximated  as: 

W  =  USV^.  (13) 


Matrix  S  =  <izat/((Ti ,  <72, 0-3 ,  <74)  is  a  diagonal  matrix  made  of  the  four  biggest  singular  values  which 
reveal  the  most  important  components  in  the  data.  Matrices  U  €  and  V  G  are  the  left 

and  right  singular  matrices  respectively,  such  that  U^U  =  V^V  =  I  (the  4x4  identity  matrix). 
By  defining, 

M  =  US2  (14) 

S  =  (15) 


we  have  the  two  matrices  whose  product  can  represent  the  bilinear  system  W.  However,^  this 
factorization  is  not  unique,  since  for  any  invertible  4x4  matrix  A,  M  =  MA  and  S  =  A“‘S  are 
also  a  possible  solution  because 

MS  =  (ma)  (a-'s)  =  MS  =  W.  (16) 

In  other  words,  the  singular  value  decomposition  (13)  provides  a  solution  both  for  shape  and  motion 
up  to  an  affine  transformation. 

The  exact  solution  can  be  computed,  using  the  fact  that  M  must  have  certain  properties.  Let  us 
denote  the  4  x  4  matrix  A  as  the  concatenation  of  two  blocks. 


A  =  [Anlat]  , 


(17) 


The  first  block  Ar  is  the  first  4x3  submatrix  related  to  the  rotational  component  and  the  second 
block  a*  is  a  4  X  1  vector  related  to  translation.  Now,  since 


M  = 


MA  = 


MAi?|Mat  , 


(18) 


we  can  impose  motion  constraints,  one  on  rotation  and  the  other  on  translation,  in  order  to  solve 
for  A. 


2.2.1  Rotation  Constraints 

Block  Ar  of  A,  which  is  related  to  rotational  motion,  is  constrained  by  the  orthonormality  of 
axes  vectors  ij  and  jj;  each  of  the  2F  rows  entries  of  matrix  MAr  is  a  unit  norm  vector  and  the 
first  and  second  set  of  F  rows  are  pairwise  orthogonal.  This  yields  a  set  of  constraints: 

ihiARA^mf  = 
mjAnAj^mJ  = 
riiiA/iA^mJ  = 
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(19) 

(20) 
(21) 


where  ih;,  liij  are  rows  i  and  j  of  matrix  M  for  i  =  1 . . .  F  and  j  =  F  +  1 . . .  2F .  This  is 
an  overconstrained  system  which  can  be  solved  for  the  entries  of  by  using  least  squares 

techniques,  and  subsequently  solving  for  A/j.  See  [16]  for  a  detailed  solution  procedure. 

2.2.2  Translation  Constraints 

In  orthography,  the  projection  of  the  3D  centroid  of  an  object  features  into  the  image  plane  is  the 
centroid  of  the  feature  points.  The  A  and  Y  position  of  the  centroid  of  the  feature  points  is  the 
average  of  each  row  of  W : 

N  ^  - 

=  Ms  =  [MAKlMaj]  ^  , 

where 

p^^Ep.  P't) 

is  the  centroid  of  the  object. 

The  origin  of  the  object’s  coordinate  system  is  arbitrary,  but  we  can  choose  to  place  it  at  the 
centroid  of  the  object,  that  is  p  =  0.  Then  it  follows  immediately  from  (23)  that 

w  =  Mai  (25) 

This  expression  is  also  an  overconstrained  system  of  equations,  which  can  be  solved  for  the  entries 
of  ai  in  the  least  square  sense.  The  best  estimate  will  be  given  by 

ai  =  (M^M)“'M^w  (26) 

=  (27) 

which  completes  the  computation  of  all  the  elements  of  matrix  A. 


(22) 

(23) 
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3  Geometrical  Interpretation  of  the  Factorization 

The  factorization  procedure  developed  in  the  previous  section  can  be  summarized  as  follows. 
Given  the  measurements  of  matrix  W,  compute  its  singular  decomposition  (13) 

W  =  USV^.  (28) 

This  gives  recovery  of  the  shape  and  motion  M  and  S  up  to  an  affine  transform.  Then,  by  using 
the  constraints  (19)-(20)  and  (27),  we  obtain  A  which  provides  a  unique  motion  and  shape  as 

W  =  MS  (29) 

S  =  A-’S  =  A-‘S2V^  (30) 

M  =  MA=:US^A.  (31) 


All  the  matrix  operations  involved  in  the  factorization  have  been  so  far  presented  from  a  pure 
numerical  and  algebraic  point  of  view.  It  is  insightful  to  give  a  geometric  interpretation  to  these 
matrices. 

Let  us  first  consider  the  right  singular  matrix  V^.  From  equation  (30)  we  see  that 

=  S-2AS.  (32) 


This  equation  reveals  the  fact  that  is  a  linear  transformation  of  the  shape.  This  transformation, 
produced  by  A  and  S,  is  done  in  such  a  way  that  the  resultant  V  is  orthonormal.  To  understand 
how  A  and  S  are  related  with  shape  we  need  to  introduce  a  few  geometric  concepts  first.  We  have 
previously  used  the  centroid  of  the  object; 


s 


HXn  ■ 

1 

EK 

P 

N 

E^n 

1 

N 

(33) 


The  centroid  is  the  first-order  moment  of  a  set  of  points.  The  second  order  moments  of  a  set  of 
points  is  given  in  homogeneous  coordinates  by 
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(36) 


The  matrix  A  consists  of  a  submatrix  Ao  and  the  centroid  vector  p.  The  submatrix  Ao  is  the 
matrix  of  the  moments  of  inertia  of  the  object.  Its  eigenvectors  represent  the  directions  of  the  three 
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symmetrical  axes  of  the  ellipsoid  of  inertia.  Using  equation  (30),  we  can  show  that  the  matrix  A  is 
written  as 

A  =  (a-‘E2V^)  (vE2A-^)  (37) 

=  A-'SA-^.  (38) 

Similarly  to  the  moments  of  an  object,  we  can  think  of  the  moments  of  motion.  Vectors  if 
and  j;  of  the  motion  matrix  represent  the  X  and  Y  axes  of  the  camera  (ie.,  image  plane)  in  object 
coordinates.  As  the  object  moves  these  vectors  describe  trajectories  in  the  unit  sphere.  The  second 
order  moments  of  motion  vectors  can  be  defined  as 


=  M^M 

-  2F  [  * 

—  II  ||2 

.  ^  11  ^  lls  . 

+ 


/  ■ 


(39) 

(40) 

(41) 


The  submatrix  fio  is  the  matrix  of  the  "moments  of  inertia"  of  the  image  plane  motion  axes.  The 
term  t  is  the  average  position  of  the  camera  origin  in  the  object’s  coordinate  system  and  ||  t  | Is  the 
average  of  the  norm  of  the  translation  vector. 

Due  to  (31)  we  can  derive  an  equation  for  Q,  similar  to  (38) 


n  =  A^SA 


(42) 


The  bilinearity  of  the  observations  is  reflected  in  the  second-order  motion  moments,  too.  By 
multiplying  the  motion  moment  (42)  and  shape  moment  (38),  we  have 

fiA  =  A^S^A-^.  (43) 


or 

nAA^  =  A^S^  (44) 

This  is  a  standard  form  of  a  4  x  4  eigensystem,  where  is  the  diagonal  matrix  of  the 
eigenvalues  and  A^  the  matrix  of  the  eigenvectors.  The  square  of  the  singular  values  af  of 
the  measurements  matrix  W  are  the  eigenvalues  of  the  product  of  motion  and  shape  moment 
matrices,  and  their  eigenvectors  form  the  rows  of  the  transformation  matrix  A.  Geometrically,  the 
eigenvectors  represent  space  orientation,  resulting  from  projecting  the  symmetry  axes  of  motion 
into  the  symmetry  axes  of  shape.  The  eigenvalues  (thus  singular  values  of  W)  represent  object’s 
"lengths"  multiplied  by  motion  moments. 
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4  The  Multi-body  Factorization  Method 

So  far  we  have  assumed  that  the  scene  contains  a  single  moving  object.  If  there  is  more  than 
one  moving  object,  the  measurement  matrix  W  will  contain  features  (columns)  which  originate 
from  different  motions.  One  may  think  that  solving  the  problem  requires  first  sorting  the  columns 
of  the  measurements  matrix  W  into  submatrices,  each  of  which  contains  features  solely  from  one 
object,  so  that  the  factorization  technique  of  the  previous  sections  can  be  applied  individually.  We 
will  show  in  this  section  that  the  multi-body  problem  can  be  solved  without  prior  segmentation. 
For  the  sake  of  simplicity  in  presentation  we  will  present  the  theory  and  method  for  the  case  of  two 
bodies,  but  it  will  be  clear  that  the  method  is  applicable  to  the  general  case  of  an  arbitrary  unknown 
number  of  objects. 

4.1  Multi-body  Motion  Recovery  Problem:  Its  Difficulty 

Suppose  we  have  a  scene  in  which  two  objects  are  moving  and  we  take  an  image  sequence  of 
F  frames.  The  relevant  coordinate  systems  in  this  case  are  depicted  in  figure  2.  Suppose  also  that 


Figure  2:  Two  bodies:  The  coordinate  systems 

the  set  of  features  that  we  have  observed  and  tracked  in  the  image  sequence  actually  consists  of 
feature  points  from  object  1  and  Nz  from  object  2  which  are  observed  in  an  image  sequence  of  F 
frames. 

Imagine  for  the  moment  that  somehow  we  know  the  classification  of  features  and  thus  could 
permute  the  columns  of  W  in  such  a  way  that  the  first  columns  belong  to  object  1  followed  by 
the  N2  columns  from  object  2.  Matrix  W  would  have  the  canonical  form: 

W*  =  [  Wi  1  W2  ]  .  (45) 

Each  measurement  submatrix  can  be  factorized  as 

W,  =  U/S,vf  (46) 
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=  M(S,  =  (M/AO(Ar‘S/)  (47) 

with  /  =  1  and  2  for  object  1  and  2  respectively.  Equation  (45)  now  has  the  canonical  factorization; 


W*  =  [MilMj 

=  [U,|U2] 

By  denoting 

M*  =  [M1IM2] ,  S*  = 


- 1 

M 

“‘tOI-* 

0 

_ 1 

> 

0 

A7*  0 

1 

0 

Th_ 

w 

1 _ 

1 - 

0 

> 

1 _ 

_  0  si . 

0  A2 

0  A2 ' 

0  S2^'  . 

0 

<! 

(48) 

(49) 


Si  0 

0  S2 


,  A*  = 


Ai  0 
0  A2 


(50) 


r  Si  0  ’ 

\yT  0  1 

U*  =  [Ui|U2],  s*  = 

0  S2 

,  = 

0 . 

< 

1 _ 

(51) 


we  express  a  factorization  in  a  similar  way  to  a  single  object  case,  where  the  canonical  measurement 
matrix  relates  to  shape  and  motion  by: 


w* 

=  M*S* 

(52) 

s* 

(53) 

M* 

=  U*S*^A* 

(54) 

From  equation  (48),  we  see  that  W*  (and  therefore  W)  will  have  at  most  rank  8,  since  W 1  and 
W2  are  at  most  rank  4.  Let  us  consider  for  the  remainder  of  this  paper  the  non-degenerate  case 
where  the  rank  of  W  is  in  fact  equal  to  8;  that  is,  the  object  shape  is  actually  three-dimensional 
(not  planar  or  line)  and  the  motion  vectors  span  3D  for  both  objects.  The  degenerate  cases  will  be 
briefly  touched  in  the  last  section  and  are  discussed  in  more  detail  in  [6]. 

In  reality,  we  do  not  know  which  features  belong  to  which  object,  and  thus  the  columns  of  the 
given  measurement  matrix  W  are  a  mixture  of  features  from  object  1  and  2.  We  can  still  apply 
singular  value  decomposition  (SVD)  to  the  measurement  matrix,  and  obtain 

W  =  USV^.  (55) 

Then  it  may  appear  that  the  remaining  task  is  to  find  the  linear  canonical  transformation  A*  in  (50) 
such  that  shape  and  motion  will  have  the  block  structure  of  equations  (53)  and  (54). 

There  is,  however,  a  fundamental  difficulty  in  doing  this.  The  metric  (rotation  and  translation) 
constraints  (eq.(19)-(20)  and  (25)-(27))  were  obtained  in  section  2.2  by  considering  that  the  motion 
matrix  for  one  object,  that  is,  by  assuming  that  the  measurement  matrix  consists  of  features  from 
a  single  object.  Those  constraints  are  therefore  applicable  only  after  knowing  the  segmentation. 
This  is  exactly  the  mathematical  manifestation  of  the  cyclic  dilemma  mentioned  earlier. 
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Faced  with  this  difficulty,  a  usual  approach  would  be  to  group  features  bit  by  bit  so  that  we 
segment  W  into  two  rank-4  matrices  and  obtain  the  factorization  of  the  form  (48).  For  example,  a 
most  simplistic  procedure  would  be  like  the  following.  Pick  the  first  four  columns  of  W  and  form 
a  rank-4  subspace.  If  the  fifth  column  belongs  to  the  subspace  (ie.  is  linear  dependent  on  the  first 
four,  or  "almost"  linear  dependent  in  the  ease  of  noisy  measurement),  then  classify  it  to  the  same 
object  as  the  first  four  columns  and  update  the  subspace  representation.  Otherwise,  it  belongs  to 
a  new  object.  Apply  this  procedure  recursively  to  all  the  remaining  columns.  This  approach  is  in 
fact  essentially  the  one  used  by  [4]  and  [7]  to  split  matrix  W,  and  similar  to  what  was  suggested 
by  Ullman  [17],  whose  criteria  for  merging  was  local  rigidity. 

There  are  a  few  disadvantages  in  this  cluster-and-test  approach.  First,  there  is  no  guarantee  that 
the  first  four  columns,  which  always  form  a  rank-4  subspace,  are  from  the  same  object.  Second,  if 
we  use  a  sequential  procedure  like  the  one  above  or  its  variation,  the  final  result  is  dependent  on 
where  we  start  the  proeedure,  and  alternatively,  the  search  for  the  globally  optimal  segmentation 
most  likely  will  be  computational  very  expensive.  Finally,  the  prior  knowledge  of  the  number  of 
objects  becomes  very  critical,  since  depending  on  the  decision  criteria  of  subspace  belongingness 
the  final  number  of  objects  may  vary  arbitrarily.^ 


4.2  Mathematical  Construct  of  Shapes  Invariant  to  Motions 

The  main  difficulty  in  the  multi-body  structure-from-motion  problem  revealed  above  is  that 
shape  and  motion  interact.  Mathematically,  the  equation  (48)  indicates  that  the  rank-8  measurement 
space  is  originally  generated  by  the  two  subspaces  of  rank  4  each,  represented  by  the  block- 
diagonal  shape  matrix  S*.  However,  the  recovered  shape  space  V^,  obtained  by  the  singular  value 
decomposition  of  the  non-canonical  W,  is  in  general  a  linear  combination  of  the  two  subspaces  and 
has  lost  the  block-diagonal  structure. 

There  is  however  a  mathematical  construct  that  preserves  the  original  subspace  structure.  Let 
us  define  Q  as  {Ni  -f  N2)  x  {Ni  +  N2)  square  matrix 

Q  =  VV^.  (56) 

We  will  call  this  matrix  the  shape  interaction  matrix.  Mathematically,  it  is  the  orthogonal  operator 
that  projects  N  =  {N]  +  N2)  dimensional  vectors  to  the  subspace  spanned  by  the  columns  of 
V.  This  matrix  Q  has  several  interesting  and  useful  properties.  First,  by  definition  it  is  uniquely 
computable  only  from  the  measurements  W  without  knowing  the  segmentation,  since  V  is  uniquely 
obtained  by  the  singular  value  decomposition  of  W. 

Secondly,  permuting  columns  of  W  does  not  change  the  set  of  values  {Qij}  that  appear  in  Q 
though  their  arrangement  in  Q  does;  swapping  columns  I  and  m  of  W  results  in  swapping  columns 
I  and  m  of  V^.  Therefore  it  results  in  simultaneously  swapping  columns  I  and  m  and  rows  I  and 
m  in  Q,  but  not  their  entry  values. 

'While  this  is  beyond  the  scope  of  the  assumption  in  this  section,  this  cluster-and-test  approach  also  requires  the 
prior  knowledge  of  the  ranks  of  objects  as  well.  Since  for  example  a  rank-8  measurement  matrix  might  have  been 
generated  by  two  line  (rank-2)  objects  and  one  full  3D  (rank  4)  object  instead  of  two  full  3D  objects,  and  therefore 
committing  to  find  two  rank-4  subspaces  might  be  wrong. 
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Thirdly,  each  element  of  Q  provides  important  information  about  whether  a  pair  of  features 
belong  to  the  same  object.  Since  the  set  of  values  do  not  change,  let  us  compute  Q*,  the  shape 
interaction  matrix  for  the  canonical  measurement  matrix  W*.  By  substituting  (53)  into  (56),  we 
obtain 


Q*  = 

_  a.*  S* 


-T  \  —  1  c  * 


)-‘S* 


=  s* 


(A.-i5.*-i/2y*r)(v*5]*-i/2A*-3^)|  '  s*  =  S*^(S*S*^)“'S 


0 

0  S2^ 


Ar^  0 

0  A2“^ 


Si  0 

0  S2 


Si^Ar'Si 

0  S2"A2"'S2 


0 

r*  -i( 


(57) 

(58) 

(59) 

(60) 
(61) 

(62) 


where  Ai  and  A2  are  the  4  x  4  matrices  of  the  moments  of  inertia  of  each  object.  This  means  that 
the  canonical  Q*  matrix  for  the  sorted  W*  has  a  very  defined  block-diagonal  structure.  Moreover, 
each  entry  has  the  value 


s?;.Ar'  sij  if  feature  trajectory  i  and  j  belong  to  object  1 

S2]A2“*  S2j  if  feature  trajectory  i  and  j  belong  to  object  2  (63) 

0  if  feature  trajectory  i  and  j  belong  to  different  objects. 


Finally  and  most  importantly,  the  set  of  values  {Q*j},  which  is  the  same  as  {Qij]  are  invariant 
to  motion.  This  is  true  since  equations  (63)  include  only  S’s,  and  not  M.  In  other  words,  in 
whatever  way  the  objects  move  they  will  produce  the  same  set  of  entries  in  matrix  Q. 

In  summary,  we  have  shown  that  without  knowing  the  segmentation  of  features  we  can  compute 
matrix  Q  whose  element  Qij  can  be  interpreted  as  a  measure  of  the  interaction  between  feature  i 
and  j:  if  the  value  is  non  zero,  they  belong  to  the  same  object,  and  if  they  don’t  belong  to  the  same 
object,  the  value  is  zero.  Also,  if  the  features  are  sorted  correctly  into  the  canonical  form  of  the 
measurement  matrix  W*,  then  the  corresponding  canonical  shape  interaction  matrix  Q*  must  be 
block  diagonal. 


4.3  Sorting  Matrix  Q  into  Canonical  Form 

The  problem  of  segmenting  and  recovering  motion  of  multiple  objects  now  has  reduced  to 
sorting  the  entries  of  matrix  Q  by  swapping  pairs  of  rows  and  columns  until  it  becomes  block 
diagonal.  Once  achieved,  the  corresponding  permutations  of  columns  of  W  will  transform  it 
its  canonical  form  where  features  from  one  object  are  grouped  into  contiguous  columns.  This 
relationship  between  sorting  Q  and  permuting  W  is  illustrated  in  figure  3. 
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Figure  3:  Segmentation  process 

With  noisy  measurements,  a  pair  of  features  from  different  objects  may  exhibit  a  small  non-zero 
entry  in  Q.  We  can  regard  Q?  as  representing  the  energy  of  the  shape  interaction,  and  the  block 
diagonalization  of  Q  can  be  achieved  by  minimizing  the  total  energy  of  all  possible  off-diagonal 
blocks  over  all  set  of  permutations  of  rows  and  columns  of  Q.  We  found  that  a  simple  iterative 
minimization  procedure  suffices  for  our  purpose.  Alternatively,  we  can  regard  matrix  {Q^j}  as 
defining  a  graph  of  Ni  +  N2  nodes,  where  the  Q^j  indicates  the  weight  of  the  link  {i,j).  We  also 
found  that  graph-theoretical  algorithms,  such  as  the  minimum  spanning  tree,  can  be  used  to  achieve 
the  block  diagonalization  more  efficiently  than  the  energy  minimization.  The  detailed  procedures 
are  presented  in  [6]. 

4.4  Summary  of  Algorithm 

While  we  have  presented  the  theory  for  the  case  of  two  full-3D  objects,  it  is  easy  to  see  tht  its 
essential  part  holds  for  more  general  cases.  First  the  matrix  Q*  has  the  block  diagonal  structure 
for  an  arbitrary  number  of  moving  objects,  that  is,  an  entry  Qij  of  the  Q  matrix  equals  to  zero  if 
features  i  and  j  belong  to  different  objects.  Furthermore,  this  property  holds  even  when  the  shape 
matrix  of  the  objects  has  rank  less  than  4  (planes  and  lines).  The  computation  of  Q  by  (56)  requires 
only  the  knowledge  of  the  total  rank  of  W,  which  we  can  determine  by  SVD.  Finally  once  Q*  is 
obtained,  instead  of  permuting  columns  of  W  we  can  use  the  equivalent  permutation  of  V^,  since 
it  is  more  computationally  efficient. 
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The  whole  algorithm  of  the  multi-body  factorization  method  is  now  summarized  as: 

1 .  Extract  and  track  features  in  the  input  image  sequence  and  create  matrix  W 

2.  Computer  =  rank{W) 

3.  Decompose  matrix  W  using  SVD 

4.  Compute  shape  interaction  matrix  Q  using  the  first  r  rows  of 

5.  Block-diagonalize  Q 

6.  Permute  matrix  into  submatrices,  each  corresponding  to  a  single  object 

7.  Compute  Ai  for  each  object,  and  thus  its  shape  and  motion. 
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5  Experiments 

We  will  present  two  sets  of  experiments  to  demonstrate  how  the  algorithm  works.  The  first  is 
an  experiment  with  synthetically  generated  feature  trajectories,  and  the  second  with  those  extracted 
from  real  images  taken  in  the  laboratory  under  controlled  imaging  conditions. 

5.1  Synthetic  Data 

Figure  4  shows  the  3D  synthetic  scene.  It  contains  three  transparent  objects  in  front  of  each  other 
moving  independently.  A  static  camera  takes  100  images  during  the  motion.  The  closest  object  to 
the  camera  is  planar  (rank  3)  and  the  other  two  are  full  3D  objects  (rank  4).  So  this  is  in  fact  a  shape- 
degenerate  case.  Each  object  translates  slightly  and  rotates  over  its  own  centroid  in  such  a  way  that 
the  features  of  all  objects  are  completely  intermingled  in  the  image  plane.  This  complication  is 
intentionally  introduced  in  order  to  demonstrate  the  fact  that  our  motion  segmentation  and  recovery 
method  does  not  use  any  local  information  in  the  images.  One  hundred  and  eighteen  (118)  points 
in  total  on  three  objects  are  chosen:  33  features  from  the  first  object,  49  from  the  second,  and  36 
from  the  third.  Figure  5  (a)  illustrates  the  actual  3D  motions  of  those  118  points. 


Figure  4:  Synthetic  scene.  Three  objects  move  transparently  with  arbitrary  motion 

The  projections  of  118  scene  points  onto  the  image  plane  during  the  motion,  that  is,  the 
simulated  trajectories  of  tracked  image  features,  are  shown  in  figure  5(b)  with  a  different  color  for 
each  object.  Independently  distributed  Gaussian  noise  with  one  pixel  of  variance  was  added  to 
the  image  feature  positions  for  simulating  errors  in  feature  tracking.  Of  course,  the  identities  of 
the  features  are  assumed  unknown,  so  the  measurement  matrix  created  by  randomly  ordering  the 
features  was  given  to  the  algorithm. 

Figure  6(a)  shows  the  shape  interaction  matrix  Q:  the  height  is  the  square  of  the  entry  value. 
The  result  of  sorting  the  matrix  into  a  bloackdiagonal  form  is  shown  in  figure  6(b).  We  can  observe 
the  three  blocks  corresponding  to  objects  3,  2  and  1;  all  of  the  1 18  features  are  correctly  classified. 

Figures  7(a)  (b)  and  (c)  show  one  view  of  each  of  the  recovered  shapes  of  the  three  objects  in 
the  same  order  as  figure  4.  Figure  7(c)  showing  the  planar  object  viewed  from  its  edge  indicates 
the  correct  recovery  of  its  shape. 
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Figure  7:  Recovered  shape  of  the  objects 


5.2  Laboratory  Data 

The  laboratory  scene  consists  of  two  roughly  cylindrical  shapes  made  by  rolling  cardboard  and 
drawing  dots  on  the  surface.  The  cylinder  on  the  right  tilts  and  rotates  in  the  plane  parallel  to  the 
image  plane  while  the  cylinder  on  the  left  hand  side  rotates  around  its  axis.  The  85  images  were 
taken  by  a  camera  equipped  with  a  telephoto  lens  to  approximate  orthographic  projections,  and 
lighting  was  controlled  to  provide  the  best  image  quality.  In  total,  55  features  are  detected  and 
tracked  throughout  the  sequence:  27  coming  the  left  cylinder  and  28  from  the  other,  while  the 
algorithm  was  not  given  that  information. 


Figure  8:  Image  of  the  objects  and  feature  tracks 

Figure  8  shows  the  85-th  frame  in  the  sequence  with  the  tracks  of  the  selected  features  super¬ 
imposed.  The  scene  is  well  approximated  by  orthography  and  the  tracking  was  very  reliable  due 
to  the  high  quality  of  the  images. 
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Figure  10:  The  recovered  shape  of  the  two  cylinders 


Figure  9(a)  show  the  shape  interaction  matrix  Q  for  the  unsorted  input  features.  The  sorted 
block  diagonal  matrix  Q*  is  shown  in  figure  9(b),  and  the  features  are  grouped  accordingly  for 
individual  shape  recovery.  The  resultant  three-dimensional  points  are  displayed  in  figure  10  with 
linearly  interpolated  surface  in  order  to  convey  a  better  perception  of  the  their  shape. 


20 


6  Discussion  and  Conclusion 

In  this  paper  we  have  shown  that  the  problem  of  multi-body  structure-from-motion  problem  can 
be  solved  systematically  by  using  the  shape  interaction  matrix.  The  striking  fact  is  that  the 
method  allows  for  segmenting  or  grouping  image  features  into  separate  objects  based  on  their 
shape  properties  without  explicitly  computing  the  individual  shapes  themselves.  Also,  no  prior 
knowledge  of  the  number  of  moving  objects  in  the  scene  is  assumed  in  the  algorithm. 

This  is  due  to  the  interesting  and  useful  invariant  properties  of  the  shape-interaction  matrix  Q. 
We  have  shown  that  Q  is  motion  invariant.  Even  when  the  matrix  is  computed  from  a  different  set 
of  image-level  measurements  W  generated  by  a  different  set  of  motions  of  objects,  its  entries  will 
remain  invariant.  Each  entry  has  the  same  unique  value  independently  of  the  trajectories  of  its  own 
and  other  object.  The  motion  invariance  property  of  Q  means  also  that  the  degree  of  complexity 
of  the  solution  is  dependent  on  the  scene  complexity,  but  not  on  the  motion  complexity. 

The  shape  interaction  matrix  Q  is  also  invariant  to  the  selection  of  individual  object  coordinate 
frames.  We  can  easily  see  that  by  considering  transforming  object  k's  shape  by  a  homogeneous 
transform  4x4  matrix  T, 

S',  =  TS,.  (64) 

The  corresponding  block-diagonal  element  matrix  of  Q*  will  be 

s'r(s',s'r)-'s',  =  (TS,)^  (rSkSlT^y  (TS,)  =  sf(s,sn-‘s,  (65) 

and  therefore  the  entries  of  the  shape  interaction  matrix  remains  the  same. 

Another  interesting  fact  is  that  the  shape  interaction  matrix  can  handle  many  degenerate  cases 
as  well,  where  one  or  more  objects  may  not  be  a  full  3-D  object,  but  a  line  or  a  planar  object.  The 
synthetic  example  in  section  5  was  in  fact  a  degenerate  case  where  a  planar  object  was  included. 
More  research  is  required  for  the  degenerate  cases  including  the  cases  where  the  motions  are 
degenerate.  Also,  in  order  to  achieve  robustness  under  the  presence  of  noise  we  need  to  relate  the 
noise  level  with  the  thresholds  necessary  in  some  of  the  decision  making  processes.  They  include 
the  identification  of  the  rank  of  the  measurement  matrix  in  the  singular  value  decomposition,  and 
the  determination  of  block-diagonality  in  sorting  the  shape  interaction  matrix.  The  report  [6] 
explores  some  of  those  issues. 
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